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Abstract 



We present methods for online linear optimization that take advantage of benign (as opposed 
to worst-case) sequences. Specifically if the sequence encountered by the learner is described well 
by a known "predictable process" , the algorithms presented enjoy tighter bounds as compared 
to the typical worst case bounds. Additionally, the methods achieve the usual worst-case regret 
bounds if the sequence is not benign. Our approach can be seen as a way of adding prior 
knowledge about the sequence within the paradigm of online learning. The setting is shown to 
encompass partial and side information. Variance and path- length bounds [11, 9] can be seen 
as particular examples of online learning with simple predictable sequences. 

We further extend our methods and results to include competing with a set of possible 
predictable processes (models), that is "learning" the predictable process itself concurrently 
with using it to obtain better regret guarantees. We show that such model selection is possible 
under various assumptions on the available feedback. Our results suggest a promising direction 
of further research with potential applications to stock market and time series prediction. 

1 Introduction 

No-regret methods are studied in a variety of fields, including learning theory, game theory, and 
information theory [ ]. These methods guarantee a certain level of performance in a sequential 
prediction problem, irrespective of the sequence being presented. While such "protection" against 
the worst case is often attractive, the bounds are naturally pessimistic. It is, therefore, desirable 
to develop algorithms that yield tighter bounds for "more regular" sequences, while still providing 
protection against worst-case sequences. Some successful results of this type have appeared in [8, 
11, 10, 9, 5] within the framework of prediction with expert advice and online convex optimization. 

In [17], a general game-theoretic formulation was put forth, with "regularity" of the sequence 
modeled as a set of restrictions on the possible moves of the adversary. Through a non-constructive 
theoretical analysis, the authors of [17] pointed to the existence of quite general regret-minimization 
strategies for benign sequences, but did not provide a computationally feasible method. In this 
paper, we present algorithms that achieve some of the regret bounds of [17] for sequences that can 
be roughly described as 



sequence = predictable process + adversarial noise 
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This paper focuses on the setting of online linear optimization. The results achieved in the full- 
information case carry over to online convex optimization as well. To remind the reader of the 
setting, the online learning process is modeled as a repeated game with convex sets J- and X for 
the learner and Nature, respectively. At each round t - 1, . . . ,T, the learners chooses ft € T and 
observes the move xt £ X of Nature. The learner suffers a loss of (ft, xt) and the goal is to minimize 
regret, defined as 

T T 

Reg T = Y,(ft> x t)~ inf £(./>*)• 
t=i f&t=\ 

There are a number of ways to model "more regular" sequences. Let us start with the following 
definition. Fix a sequence of functions Mt ■ X^ 1 h> X, for each t e {1, . . . , T} = [T]. These functions 
define a predictable process 

Mi, M 2 (>i), M T (xi,...,x r _i) . 

If, in fact, = M t (x\, . . . ,xt-i) for all i, one may view the sequence {xt} as a (noiseless) time 
series, or as an oblivious strategy of Nature. If we knew that the sequence given by Nature follows 
exactly this evolution, we should suffer no regret. 

Suppose that we have a hunch that the actual sequence will be "roughly" given by this predictable 
process: xt ~ M t (x\,. . . ,xt~i)- In other words, we suspect that the sequence is described as 
predictable process plus adversarial noise. Can we use this fact to incur smaller regret if our 
suspicion is correct? Ideally, we would like to "pay" only for the unpredictable part of the sequence. 



Information-Theoretic Justification Let us spend a minute explaining why such regret bounds 
are information-theoretically possible. The key is the following observation, made in [17]. The non- 
constructive upper bounds on the minimax value of the online game involve a symmetrization step, 
which we state for simplicity of notation for the linear loss case with J- and X being dual unit balls: 



sup E ei . . . sup E tT 



T 

Y^ejx't-xt) 

t=l ' 



< 2 supE ei . . . supE. 



XT 



T 

t=i 



If we instead only consider sequences such that at any time t e [T], xt and x' t have to be <Tt-close to 
the predictable process M t (xi, . . . ,xt-i), we can add and subtract the "center" M t on the left-hand 
side of the above equation and obtain tighter bounds for free, irrespective of the form of 
M t (xi, . . . , xt-\). To make this observation more precise, let 



C t = C t (xi, . . .,x t -i) = {x : \\x-M t {xi, . . .,x t -x)\* < a t } 
be the set of allowed deviations from the predictable "trend" . We then have a bound 



(1) 



sup E 6l . . . sup E eT 



l 

Y, e t [x' t - M t (xi,. . .,x t -i) + M t (x 1 , x t -i) - x t j 
t=i 



< CA 



\ t=l 



T 



on the value of the game against such "constrained" sequences, where the constant c depends on 
the smoothness of the norm. This short description only serves as a motivation, and the more 
precise statements about the value of a game against constrained adversaries can be found in [17]. 
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The development so far is a good example of how a purely theoretical observation can point to 
existence of better prediction methods. What is even more surprising, for most of the methods 
presented below, the individual o~t's need not be known ahead of time except for their total sum 
Y,J=iO~t- Moreover, the latter sum need not be known in advance either, thanks to the standard 
doubling trick, and one can obtain upper bounds of 



T 



Y,(ft,x t )-mf_Y,(f,x t )<c\ Y,W x t- M t(xi,...,x t -i)\\l (2) 

\ t=i 



T T 

Y,(ft,x t )- inf X 
on regret, for some problem-dependent constant c. 

Let us now discuss several types of statistics M t that could be of interest. 
Example 1. Regret bounds in terms of 

M t (xi, . . .,x t -i) = xt-i 

are known as path length bounds [17, 9j. Such bounds can be tighter than the pessimistic O(VT) 
bounds when the previous move of Nature is a good proxy for the next move. 



Regret bounds in terms of 



1 4-1 

M t (x 1 ,...,x t -i) = - — - Y,- 

f ~ 1 8=1 



are known as variance bounds [8, 10, 11, 17]. One may also consider fading memory statistics 

t-i t-i 
M t (x 1 ,...,x t -i) = Y, a sX s , £a s = l, a s >0 

8=1 S=l 

or even plug in an auto-regressive model. 

If "phases" are expected in the data (e.g., stocks tend to go up in January), one may consider 

M t (xi, . . .,x t -i) = x t -k 

for some phase length k. Alternatively, one may consider averaging of the past occurrences Tj(t) c 
{1, . . . ,t} of the current phase j to get a better predictive power: 

M t (xi, . . -,x t -i) = Y, a sX s ■ 

seT t 



Interpreting the Bounds The use of a predictable process {Mt)t>\ can be seen as a way of 
incorporating prior knowledge about the sequence (xt)t>i. Importantly, the bounds still provide 
the usual worst-case protection if the process M t does not predict the sequence well. To see this, 

observe that the bounds of the paper scale with yl Yd=i \ x t ~ Mt\1 ^ 2max K ^ ||x||\/T which is only 
a factor of 2 larger than the typical bounds. However when Mj's do indeed predict xts well we 
have low regret, a property we get almost for free. Notice that in all our analysis the predictable 
process (Mt)t>i can be any arbitrary function of the past. 
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A More General Setting The predictable process Mt has been written so far as a function 
of xi, . . . , Xt-i, as we assumed the setting of full-information online linear optimization (that is, 
xt is revealed to the learner after playing ft). Whenever our algorithm is deterministic, we may 
reconstruct the sequence fx,. . . ,ft given the sequence xi, . . . , xt-i, and thus no explicit dependence 
of Mt of the learner's moves are required. More generally, however, nothing prevents us from 
defining the predictable process M t function 

M t (Ji, . . . , It-i,fi, ■ ■ ■ ,ft-l,Qi, ■ ■ ■ , Qt-i) (3) 

where I s is the information conveyed to the learner at step s e [T] (defined on the appropriate 
information space I) and q s is the randomized strategy of the learner at time s. For instance, in 
the well-studied bandit framework, the feedback I s is defined as the scalar value of the loss (f s ,x s ), 
yet the actual move x s may remain unknown to the learner. More general partial information 
structures have also been studied in the literature. 

When Mt is written in the form (3), it becomes clear that one can consider scenarios well beyond 
the partial information models. For instance, the information I s might contain better of complete 
information about the past, thus modeling a delayed-feedback setting (see Section 6.1). Another 
idea is to consider a setting where I s contains some state information pertinent to the online learning 
problem. 

The paper is organized as follows. In Section 2, we provide a number of algorithms for full- 
information online linear optimization, taking advantage of a given predictable process Mt- These 
methods can be seen as being "optimistic" about the sequence, incorporating Mt+\ into the calcu- 
lation of the next decision as if it were the true. We then turn to the partial information scenario in 
Section 3 and show how to use the full-information bounds together with estimation of the missing 
information. Along the way, we prove a bound for nonstochastic multiarmed bandits in terms of the 
loss of the best arm - a result that does not appear to be available in the literature. In Section 4 
we turn to the question of learning Mt itself during the prediction process. We present several 
scenarios which differ in the amount of feedback given to the learner. Finally, we consider delayed 
feedback and some other scenarios that fall under the general umbrella. 

Remark 1. We remark that most of the regret bounds we present in this paper are of the form 
Arf 1 + Br]Y,J = i \\xt - Mf\\*. If variation around the trend is known in advance, one may choose r\ 
optimally to obtain the form in (2). Otherwise, we employ the standard doubling trick which we 
provide for completeness in Section 8. The doubling trick sets n in a data- dependent way to achieve 
(2) with a slightly worse constant. 

Notation: We use the notation y t '-.t to represent the sequence yf, ■ ■ ■ ,yt- We also use the notation 
x[i] to represent the i th element of vector x. We use the notation x[l ■ c] to represent the c- 
dimensional vector (x[l], . . . , x[c]). Dn(f, /') is used to represent the Bregman divergence between 
/ and /' w.r.t. function R. We denote the set {1, . . . , T} by [T]. 

2 Full Information Methods 

In this section we assume that the value Mt is known at the beginning of round t: it is either 
calculated by the learner or conveyed by an external source. The first algorithm we present is a 



4 



modification of the Follow the Regularized Leader (FTRL) method with a self-concordant regu- 
larizes The advantage of this method is its simplicity and the close relationship to the standard 
FTRL. Next, we exhibit a Mirror Descent type method which can be seen as a generalization of the 
recent algorithm of [ ]. Later in the paper (in Section 5) we also present full-information methods 
based on the idea of a random playout, developed in [ 5] for the problem of regret minimization in 
the worst case. To the best of our knowledge, these results are the first variation-style bounds for 
Follow the Perturbed Leader (FPL) algorithms. 

For all the methods presented below, we assume (without loss of generality) that M\ = 0. Since we 
assume that Mt can be calculated from the information provided to the learner or the value of M% 
is conveyed from outside, we do not write the dependence of M t on the past explicitly. 

2.1 Follow the Regularized Leader with Self-Concordant Barrier 

Let T c R d be a convex compact set and let 7Z be a self-concordant function for this set. Without 
loss of generality, suppose that min.f^TZ(f) = 0. Given / e int(.F), define the local norm || • \\f with 
respect to TZ by \\g\\f = \J g J (V 2 1Z(f))g. The dual norm is then \x\*f - \J x J (V 2 'R-(f)Y l x. Given 
the ft defined in the algorithm below, we use the shorthand || • ||t = || • ||/ t - 

Consider the following algorithm. 



Optimistic Follow the Regularized Leader 

Input: 1Z self-concordant barrier, learning rate rj > 0. Initialize f\ = argminj^T^/). 
At t - 1, . . . , T, predict ft, observe xt, and update 

/ t+ i = argmin rj If, £ x s + M t+ i) + 11(f) 
\ s=i I 



We notice that for Mt+i = 0, the method reduces to the Follow the Regularized Leader (FTRL) 
algorithm of [1, 3]. When Mt+i t 0, the algorithm can be seen as "guessing" the next move and 
incorporating it into the objective. If the guess turns out to be correct, the method should suffer 
no regret, according to the "be the leader" analysis. 

The following regret bound holds for the proposed algorithm: 

Lemma 1. Let T c M. d be a convex compact set endowed with a self- concordant barrier 1Z with 
min f e jrTZ(f) = 0. For any strategy of Nature, the Optimistic FTRL algorithm yields, for any 

£ (fuxt) - E (f,x t ) < v^nn + 2g Y,(\\ Xt - MtU) 2 (4) 
t=l t=i t=i 

as long as rj\\xt - Mt\\t < 1/4 for all t. 

By the argument of [1 , 3], at the expense of an additive constant in the regret, the comparator /* 
can be taken from a smaller set, at a distance 1/T from the boundary. For such an /*, we have 
lZ(f*) < "!?logT where i? is a self-concordance parameter of 1Z. 
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2.2 Mirror-Descent algorithm 



The next algorithm is a modification of a Mirror Descent (MD) method for regret minimization. 
Let 1Z be a 1-strongly convex function with respect to a norm || • ||, and let Dn(-,-) denote the 
Bregman divergence with respect to 1Z. Let VTZ* be the inverse of the gradient mapping VIZ. Let 
|| • || * be the norm dual to || • || . We do not require T and X to be unit dual balls. 

Optimistic Mirror Descent Algorithm 

Input: 7Z 1-strongly convex w.r.t. || • ||, learning rate rj > 0. Initialize f\ = argmhiy- 1Z(f) 
At t = 1, . . . , T, predict ft, update 



and project onto T 



fL^Vn^-vj^Xs-rtMn^ 



ft+i = argmm D n (f,fl +1 ) 



If M t+ i = 0, one recovers the Mirror Descent algorithm. To see this, recursively define 

g t+ i = VTZ*(VTZ(g t )-rjx t ) 

for all i > 1 and g\ = &rgmin g lZ(g). The projection of gt+i onto T with respect to D-ji is precisely 
the standard Mirror Descent algorithm. Now, since VTZ(g±) = 0, unwinding the recursion we get 
9t+i = VIZ* (-77 Es=i x s)- Hence, ft+i coincides with the Mirror Descent solution when M t +\ = 0. 

Given the definition of gt+i, the update for f t +\ can then be written as 

fl +1 = VK*(VlZ(g t+1 )-r } M t+1 ) 

followed by a projection with respect to the Bregman divergence D-n- It is not difficult to verify 
that this gives the following equivalent form of the algorithm: 



Optimistic Mirror Descent Algorithm (equivalent form) 

Input: 1Z 1-strongly convex w.r.t. || • ||, learning rate rj > 

Initialize j\ = g± = avgmirigTZ(g) 

At t = 1, . . . , T, predict ft and update 

g t+1 = VK*(VlZ(g t )-r]Xt) 
ft+i = argmin rj (f ', M t+1 ) + D n (f ', g t+1 ) 



We remark that the update for g t +\ does not require the projection onto the set T . If for some 
reason projection is desirable, it takes the form 

g t +i = argmin 77 (g, x t ) + D n (g, g t ) . (5) 
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Such a two-projection algorithm for the case Mt = xt-i has been exhibited recently in [9]. However, 
the projection is unnecessary and appears to complicate the simple form and the interpretation of 
the Optimistic Mirror Descent. 

Lemma 2. Let J- be a convex set in a Banach space B and X be a convex set in the dual space 
B* . Let 1Z ■ B M be a 1-strongly convex function on T with respect to some norm \\ ■ \\. For any 
strategy of Nature, the Optimistic Mirror Descent Algorithm (with or without projection for gt+\) 
yields, for any f* e T , 

E (ft,*t) - E (f*,*t) * v-'rL, + 1 E U -M t \l 
t=i t=i z t=i 

where i?„ax = m ax/ e ^7^(/) - min/s-r 11(f). 

As mentioned before, the sum Y,t=i \\ x t ~ nee d not be known in advance in order to set rj, as 
the usual doubling trick can be employed. Both the Optimistic MD and Optimistic FTRL work in 
the setting of online convex optimization, where Xt's are now gradients at the points chosen by the 
learner. Last but not least, notice that if the sequence is not following the trend Mt as we hoped it 
would, we still obtain the same bounds as for the Mirror Descent (respectively, FTRL) algorithm, 
up to a constant. 



2.2.1 Local Norms for Exponential Weights 

For completeness, we also exhibit a bound in terms of local norms for the case of T c M d being the 
probability simplex and X being the ball. In the case of bandit feedback, such bounds serve as 
a stepping stone to building a strategy that explores according to the local geometry of the set [2] . 
Letting 11(f) = T.f=i /(*)1°§/W ~ 1> the Mirror Descent algorithm corresponds to the well-known 
Exponential Weights algorithm. We now show that one can also achieve a regret bound in terms 
of local norms defined through the Hessian V 2 1Z(f), which is simply diag(/(l) _1 , . . . , f(d)~ x ). To 
this end, let \\g\\ t = ^g J \/ 2 1Z(f t )g and = ^xV 2 1Z(ftY l x. 

Lemma 3. The Optimistic Mirror Descent on the probability simplex enjoys, for any f* e T , 

Z(ft-f,x t )<2 V Z(\\xt-M t \\* t ) 2 + ] ^ 
t=l t=i V 

as long as rj\\xt - Mt||oo ^ 1/4 at each step. 



3 Methods for Partial and Bandit Information 

We now turn to the setting of partial information and provide a generic estimation procedure along 
the lines of [10]. Here, we suppose that the learner receives only partial feedback It which is simply 
the loss (ft,x t ) incurred at round t. Once again, we suppose to have access to some predictable 
process Mt- Note the generality of this framework: in some cases we might postulate that Mt needs 
to be calculated by the learner from the available information (which does not include the actual 
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moves xt); in other cases, however, we may assume that some statistic Mt (such as some partial 
information about the past moves) is conveyed to the learner as a side information from an external 
source. For the methods we present, we simply assume availability of the value Mt- 

As in Section 2.1, we assume to have access to a self-concordant function TZ for T , with the self- 
concordance parameter d. Following [1], at time t we define 1 our randomized strategy qt to be a uni- 
form distribution on the eigenvectors of V 2 TZ(ht) where ht e T is given by a full-information proce- 
dure as described below. The full-information procedure is simply Follow the Regularized Leader on 
the estimated moves Xi,..., i,t-i constructed from the information h, ■ ■ ■ , It-i, fi, • • • , ft-l, Qi, ■ ■ ■ , Qt-i, 
with I s = (f s ,x s ). The resulting algorithm, dubbed SCRiBLe in [3], is presented below for com- 
pleteness: 



SCRiBLe [3, 1] 

Input: rj > 0, i9-self-concordant TZ. Define hi = argminj e ^-7?.(/). 
At time t = 1 to T 

Let {Ai, . . . , A n } and {Ai, . . . , A n } be the eigenvectors and eigenvalues of S7 2 TZ(h t ). 
Choose it uniformly at random from {1, . . . , n} and Et = ±1 with probability 1/2. 

— 1/2 

Predict ft = h t + e t \ i Aj t and observe loss (ft,xt). 



Define x t := n ((ft, x t )) £t\ t 
Update 



1/2 



h 



t+i 



A; 



arg mm 



rj[h, Y,x t 

s=l 



+ K(h) 



Hazan and Kale [ I ] observed that the above algorithm can be modified by adding and subtracting 
an estimated mean of the adversarial moves at appropriate steps of the method. We use this idea 
with a general process Mt: 



SCRiBLe for a Predictable Process 

Input: r] > 0, -^-self-concordant TZ. Define hi = argminj e jr7^(/). 
At time t = 1 to T 

Let {Ai, . . . , A n } and {Ai, . . . , A n } be the eigenvectors and eigenvalues of V 2 TZ(ht). 
Choose it uniformly at random from {1, . . . , n] and Et = ±1 with probability 1/2. 

— 1/2 

Predict ft - ht + £t^ it Aj t and observe loss (ft,xt). 
Define x t := n((f t , x t - M t )) e t \]' 2 ■ A it + M t . 



Update 



H+l 



arg mm 



rj (h, ^Xs + Mt+A+TZ^) 



s=l 



The analysis of the method is based on the bounds for full information predictable processes Mt 
developed earlier, thus simplifying and generalizing the analysis of [10]. 



1 We caution the reader that the roles of f t and xt in [1, 10] are exactly the opposite. We decided to follow the 
notation of [16, 15], where in the supervised learning case it is natural to view the move ft as a function. 
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Lemma 4. Suppose that T is contained in the £2 ball of radius 1. The expected regret of the above 
algorithm (SCRiBLe for a Predictable Process) is 



E 



T T 

t=i t=i 



<77 _1 ft(/*) + 277ra 2 E 



£((/ t ,* t -M t )) 5 



Lt=l 



(6) 



T 

< r?-^(r ) + 27]n 2 £E [\\x t - M t f] 
t=x 

Hence, for any full-information statistic M[ = M[(xx, . . . , Xt-i) 
E 



T T 
t=X t=X 



T T 

< if'Kif) + Ann 2 £ E[\\xt - M t '|| 2 ] + inn 2 £E [\\M t - M' t \ 2 ] (7) 
t=l t=i 



Effectively, Hazan and Kale show in [10] that for the full-information statistic M/(xi, . . . , Xt-i) = 
tz± Es=i %s, there is a way to construct M% = Mt(I\, . . . , It-i, fx, ■ ■ ■ , ft-x,Qx, • • • > <ft-i) in such a way 
that the third term in (7) is of the order of the second term. This is done by putting aside roughly 
O(logT) rounds in order to estimate M[, via a process called reservoir sampling. However, for 
more general functions M/, the third term might have nothing to do with the second term, and the 
investigation of which M[ can be well estimated by Mt is an interesting topic of further research. 

4 Learning The Predictable Processes 

So far we have seen that the learner with an access to an arbitrary predictable process (Mt)t>x 

and suffer regret of Oy\J Ya=x \ x t ~ • Now if the predictable process is a good predictor of 

the sequence, then the regret will be low. This raises the question of model selection: how can the 
learner choose a good predictable process (Mj)^>i? Is it possible to learn it online as we go, and if 
so, what does it mean to learn? 

To formalize the concept of learning the predictable process, let us consider the case where we have 
a set IT indexing a set of predictable processes (strategies) we are interested in. That is, each ir e II 
corresponds to predictable process given by (M^)t>x- Now if we had an oracle which in the start 
of the game told us which ir* e II predicts the sequence optimally (in hindsight) then we could use 
the predictable process given by (M£ )t>x and enjoy a regret bound of 



/ 



O 



N 



T 

inf E 

7reIIi = l 



X t - MT 



I 



However we cannot expect to know which ir e n is the optimal one from the outset. In this scenario 
one would like to learn a predictable process that in turn can be used with algorithms proposed 
thus far to get a regret bound comparable with regret bound one could have obtained knowing the 
optimal 7T* e n. 



4.1 Learning M t 's : Full Information 

To motivate this setting better let us consider an example. Say there are n stock options we can 
choose to invest in. On each day t, associated with each stock option one has a loss/payoff that 
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occurs upon investing in a single share of that stock. Our goal in the long run is to have a low regret 
with respect to the single best stock in hindsight. Up to this point, the problem just corresponds 
to the simple experts setting where each of the n stocks is one expert and on each day we split our 
investment according to a probability distribution over the n options. However now additionally 
we allow the learner/investor access to prediction models from the set II. These could be human 
strategists making forecasts, or outcomes of some hedge- fund model. At each time step the learner 
can query prediction made by each tt e II as to what the loss on the n stocks would be on that 
day. Now we would like to have a regret comparable to the regret we can achieve knowing the best 
model tt* € II that in hind-sight predicted the losses of each stock optimally. We shall now see how 
to achieve this. 



Optimistic Mirror Descent Algorithm with Learning the Predictable Process 

Input: 1Z 1-strongly convex w.r.t. || ■ ||, learning rate 77 > 

Initialize /i = <7i = argmin s 72.(g) and initialize q\ e A(II) as, W e II,gi(7r) = A 
SetM^E^nftWMf 

At t = 1, . . . , T, predict ft, observe xt and update 

e II, q t+1 (ir) oc qt (TT) e ^ M *-^ and M t+1 = £ q t+1 (ir)M? +1 

Trell 

g ul = vn*{vn{g t )-r]x t ) 
ft+i = argmin rj (/, M t+1 ) + D n (f, g t+1 ) 



The proof of the following lemma relies on a particular regret bound of [7, Corollary 2.3] for the 
exponential weights algorithm that is in terms of the loss of the best arm. Such a bound is an 
improvement over the pessimistic regret bound when the loss of the optimal arm is small. 

Lemma 5. Let J- be a convex subset of a unit ball in a Banach space B and X be a convex subset 
of the dual unit ball. Let 1Z : B M be a 1-strongly convex function on T with respect to some norm 
|| • || . For any strategy of Nature, the Optimistic Mirror Descent Algorithm yields, for any f* e J-, 

E (ft,xt) - E </*>zt) * ^max + 3.2 r? (inf f) \\x t - M?\l + log|II|) 

t=l t=l \7reni=l / 

where if!^ ax = rnax/g^- IZ(f) - min/ e ^7^(/). 

Once again, let us discuss what makes this setting different from the usual setting of experts. The 
forecast given by prediction models is in the form of a vector, one for each stock. If we treat each 
prediction model as an expert with the loss \\xt— M*]]*, the experts algorithm would guarantee that 
we achieve the best cumulative loss of this kind. However, this is not the object of interest to us, as 
we are after the best allocation of our money among the stocks, as measured by infj e: p T,t=i (f? x t)- 

The algorithm can be seen as separating two steps: learning the model (that is, predictable process) 
and then minimizing regret given the learned process. This is implemented by a general idea of 
running another (secondary) regret minimizing strategy where loss per round is simply || Mt - x± \ » 
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and regret is considered with respect to the best n e II. That is, regret of the secondary regret 
minimizing game is given by 

f> 4 -M^-inff> t -Mn! 

i=l 7rent = l 

In general, the experts algorithm for minimizing secondary regret can be replaced by any other 
online learning algorithm. 



4.2 Learning Mt's : Partial Information 

In the previous section we considered the full information setting where on each round we have 
access to xt and for each ir we get to see (or compute) M£ . However one might be in a scenario 
with only partial access to xt or M% , or both. In fact, there are quite a number of interesting 
partial-information scenarios, and we consider some of them in this section. 



4.2.1 Partial Information about Loss (Bandit Setting) 



In this setting at each time step t, we only observe the loss (ft,xt) and not all of xt- However, 
for each ir e n we do get access to (or can compute) Mf for each tt e II. Consider the following 
algorithm: 



SCRiBLe while Learning the Predictable Process 

Input: 7] > 0, i9-self-concordant 1Z. Define h\ = argminj e ;c-7£(/). 
Initialize q\ e A(n) as, W e n,gi(7r) = pi- 

Set M^E.enftW^ 
At time t = 1 to T 

Let {Ai, . . . , A n } and {Ai, . . . , A n } be the eigenvectors and eigenvalues of V 2 1Z{ht). 
Choose it uniformly at random from {1, . . . , n} and £t = ±1 with probability 1/2. 

— 1/2 

Predict ft - h t + £*A it Aj t and observe loss (ft,xt). 

Define x t := n((f t , x t - M t )) EtXV 2 ■ A it + M t . 
Update 



vren 



Vvr e n, q ul (ir) oc q t { n ) e -«/ t ^>-</ t ,A/-» 2 and M ^ = ^ qt+1 ( n ) Ml 

ijlh^Xs + Mt+A+TZih) 



t+i 



h t+ i = arg min 



s=l 



The following lemma upper bounds the regret of this algorithm. The proof once again uses a regret 
bound in terms of the loss of the best arm [, , Corollary 2.3]. 
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Lemma 6. Suppose that T,X are contained in the £2 ball of radius 1. The expected regret of 
SCRiBLe while Learning the Predictable Process is 



E 



£(/*,^>-£</Vt> 

Li=l i=l 



<T]- 1 K(f*) + 2rin z E 



£((/ t ,x t -M t )) 5 



1=1 



(8) 



mf£\\x t -Ml 

7rent=l 



Til 2 



+ logini 



4.2.2 Partial Information about Predictable Process 



Now let us consider the scenario where on each round we get to see xt e X. However, we only 
see M^ 1 for a single 7rj e II we select on round t. This scenario is especially useful in the stock 
investment example provided earlier. While xt the vector of losses for the stocks on each day can 
easily be obtained at the end of the trading day, prediction processes might be provided as paid 
services by various companies. Therefore, we only get to access a limited number of forecasts on 
each day by paying for them. In this section, we provide an algorithm with corresponding regret 
bound for this case. 



Optimistic MD with Learning the Predictable Processes with Partial Infor- 
mation 

Input: 1Z 1-strongly convex w.r.t. || ■ ||, learning rate r\ > 

Initialize g\ - argmin 9 IZ(g) and initialize qi e A(II) as, W e n,gi(-7r) = pj 

Sample 7Ti ~ qi and set /1 = argmin r](f,Mp) + Dn(f,gi) 

At t = 1, . . . , T, predict ft and : 

Update qt using SCRiBLe for multi-armed bandit with loss of arm ir t : \\Mp - xt\\1 

and step-size 1/32|LL| 2 . 
Sample nt+i ~ qt+i and observe M^ 1 
Update 

g t+ i = vn*{vn{g t )-'nxt) 

f t+1 = argmin V (f,M^) + D n (f,g t+1 ) 



Due to the limited information about the predictable processes, the proofs of Lemmas 7 and 8 
below rely on an improved regret bound for the multiarmed bandit, an analogue of [7, Corollary 
2.3]. Such a bound is proved in Lemma 13 in Section 7. 

Lemma 7. Let J- be a convex set in a Banach space B and X be a convex set in the dual space 
B* , both contained in unit balls. Let 1Z ■ B h> R be a 1-strongly convex function on T with respect 
to some norm || • || . For any strategy of Nature, the Optimistic MD with Learning the Predictable 
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Processes with Partial Information Algorithm yields, for any f* e T ', 



E 



T 



T 



-E(r,^)<^ 1 J RLx+?E 



t=i 



(9) 



^ »7 + ?7 



(e inf Y, 



x t -M, 7r ||2+32|n| d log(T|n|) 



w/jere R 2 m&yi = max /£f K(/) - min /6j r^(/). 



4.2.3 Partial Information about both Loss and Predictable Process 

In the third partial information variant, we consider the setting where at time t we only observe 
loss (ft,xt) we suffer at the time step (and not entire xt) and also only Mp corresponding to the 
predictable process of Tr t e IT we select at time t. This is a blend of the two partial-information 
settings considered earlier. 



SCRiBLe for Learning the Predictable Process with Partial Feedback 

Input: 7] > 0, i9-self-concordant 1Z. Define hi = argmmf e ^TZ(f). 

Initialize q\ e A(II) as, V-7T e II,gi(7r) = ^ 

Draw 7Ti ~ q\ 

At time t = 1 to T 

Let {Ai, . . . , A n } and {Ai, . . . , A n } be the eigenvectors and eigenvalues of \7 2 TZ(ht). 
Choose it uniformly at random from {1, . . . ,n} and St = ±1 with probability 1/2. 
Predict f t = h t + ejA^^A^ and observe loss {ft,x t }. 

Define x t := n((f t , x t - M?*)) e t X] 1 * ■ A H + Mp . 

Update qt using SCRiBLe for multi-armed bandit with loss 

of arm vr t e n: ((f t ,x t ) - (f t ,Mp)) 2 and step size 1/32|LI| 2 . 
Draw wt+i ~ qt+i and update 



fo t+ i = arg min 



rtlh^Xs + MZ+A + nih) 



s=l 



Lemma 8. Suppose that J-,X are contained in the £2 ball of radius 1. The expected regret of 
SCRiBLe for Learning the Predictable Process with Partial Feedback is 



E 



T T 

t=i t=i 



<7 ] - 1 n(f) + 2r ] n 2 E 
<rj- l n{f*) + Ann 2 \¥, 



YiUuXt-Mp)) 2 
t=i 

inf f> t -M t l 2 



(10) 



+ 32|n| 3 iog(r|n|) 



) 
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5 Randomized Methods and the Follow the Perturbed Leader Al- 
gorithm 



In this section we are back in the setting of Section 2, where the single process Mt can be calculated 
by the learner. We show that randomized methods of the Follow the Perturbed Leader (FPL) style 
[12, 7] can also enjoy better bounds for predictable sequences. For convenience, we suppose T c M. d 
is a unit ball in some norm || • || , and X is a unit ball in the dual norm || • || * . 

The central object in the algorithmic development of [15] is the notion of a relaxation. We now 
present this notion in the context of a constrained adversary [ ] in order to develop randomized 
methods that attain bounds in terms of the sizes at of deviations from the trend Mj. The downside 
of the methods we present in this section is that individual deviations at need to be known in 
advance by the learner. We believe that this requirement can be relaxed, and this will be added in 
the full version of this paper. 

A relaxation Rel is a sequence of functions Rely (JF\x\, . . . ,xt) for each t e [T]. We shall use the 
notation Relf ( T) for Rely (JF|{}). For the problem of a constrained sequence, with constraints 
given by the sequence of C±, . . . ,Ct (see Eq. (1)) a relaxation will be called admissible if for any 
x%, . . . , xt e X , 

Kel T (T\xi, . . . ,x t ) > inf sup I E f „ q (f,x) + ~Rel T (T\xi, . . . ,x t ,x) \ (11) 

qeA(F)xeC M (xi,...,x t ) 



for all t € [T - 1], and 



Rel T (T\xi, . . .,x T ) > - inf Y (f,x t ) . 

f&t=l 



If Ct+i(xi, . . . ,xt) = X for all t e [T], we recover the setting of an unconstrained adversary studied 
in [15]. 

Any choice q that ensures (11) for an admissible relaxation guarantees (irrespective of the strategy 
of the adversary) that 



f>/ t ~* (ft, x t ) - inf £ (/, xt) < Rel T {?) , 

f=i f&t=l 



(12) 



a fact that is easy to prove. It is shown in [15] that for many problems of interest, when searching 
for a computationally feasible relaxation, one may start with the conditional sequential Rademacher 
complexity and find a computationally attractive upper bound. For the case of constrained adver- 
saries, this complexity becomes (for the case of T being a unit ball) 



sup 



which can be re-written as 



sup E £t 

x T eC T (xi :T -i) 



T t 

2 Y, e s (x s - M s (x 1:s -i)) - Y^Xi 

s=t+l s-1 



(13) 



sup E et+1 

Zt+l : \\ z t+l |U<Ct+l 



sup 

zt : \\zt\\*^°'t 



E 



T 
s=t+l 



5=1 



(14) 
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Here, one may think of the adversary as choosing the z^s as small deviations from the predictable 
process Mt. The following step is a key idea: since the computation of the interleaved supremum 
and expectations is difficult, we might be able to come up with an almost-as-difficult distribution 
and draw zts i.i.d. The following is an assumption that is easily verified for many symmetric 
distributions [15]. 

Assumption 1. For every t e [T], there exists a distribution Dt and constant C > 2 such that for 



any w e . 



sup IE ||^ + 2e^|| ^ < E E \\w + Cez\\ 

z:\\z\L<a t e z ~ D t e 



(15) 



and Kz~Dt\\ z \\1 ^ °t f or an V 



To satisfy this assumption, one may simply take one of the distributions in [ ] for the unconstrained 
case, and scale it by at- 



Lemma 9. For the distributions D\, . . . ,D? satisfying Assumption 1, the relaxation 



Rel T (J r \x 1 ,...,x t ) = E E e 

z t +i~Dt+i,...z T ~D T 



T t 

C E e%Z{ — 2^ X{ 

i=t+l i=l 



(16) 



is admissible and a randomized strategy that ensures admissibility is given by: at time t, draw 
zt+i, . ■ ■ ,Zt and Rademacher random variables e = (ej+i, . . . , et), and then define 



ft = argmin sup \(g,x t ) + 

g^T xteC t (x 1:t -i) 



T t-l 
C ^ €iZi — y^,Xj — Xf 
i=t+l i=l 



(17) 



The expected regret for the method is bounded by the classical Rademacher complexity 

EReg T < C E Zl:T E e 



T 

t=i 



where each random variable zt has distribution Dt. For any smooth norm, the expected regret can 



be further upper bounded by O 
Let us define the random vector 

t-l T 

Rt ■= Y. X i~ C Y. e i Z i + M t 
i=l i=t+l 

where the first sum is the cumulative cost vector, the second sum may be viewed as a random 
perturbation of the cumulative cost, and the final term is simply the predictable process at time t. 
We may rewrite (17) as 

ft = argmin sup { (/, x t ) + \\R t + x t - M t \\ ^ 1 
feT x t eC t (xi:t-i) 1 J 



argmin sup | (/, z + Mt) + \\Rt + z\\ t \ 
z-\z\„<Oi 1 J 



(18) 



This is a general form of the randomized method for online linear optimization. As shown in [15], 
this form in fact reduces to the more familiar form of the FPL update in certain cases. 
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5.1 Randomized Algorithm for the ii/i^ Case 



We now show that for the case of T being an l\ ball and X being an ball, the solution in 
(18) takes on a simpler form. In particular, for Mt - the solution is simply an indicator on the 
maximum coordinate of Rt, which is precisely the Follow the Perturbed Leader solution. 

Theorem 10. For the distributions D\, . . . ,Dt satisfying Assumption 1, consider the randomized 
strategy that at time t, draws Zt+i, ■ ■ ■ ,Zt from Dt+i, . . . , Dt respectively and Rademacher random 
variables e = (e^+i, . . . , ex), and then outputs 



ft 



-sign(M t [i£])ej 



if a t - \M t [%]\ < - Wt signer]) + M t [f t }\ 



-sign(a t R t [jt] + M t [j^])ej* otherwise 



(19) 



where Rt = T,l=i %i~C Tj =t +i e i z i + Mt, jt = argmax |-Rt[i]| and i* t = argmax |M 4 [i]|. The expected 



regret is bounded as : 



ie[d] 



E [Reg T ] < C E Zl:T E € 



T 



+ 4 £P(£ t c ) ■ 

t=i 



5.2 Randomized Algorithm for the Simplex 

Given an algorithm for regret minimization over the probability simplex (as in the case of experts), 
through a standard argument one also obtains an algorithm for the t\ ball by doubling the number of 
coordinates. We now show that the randomized method for the l\ ball, developed in the previous 
section, can be used to solve the problem over the probability simplex, a converse implication. 
Specifically, we have the following corollary: 

Corollary 11. For the distributions D\, . . . ,Dj- satisfying Assumption 1, consider the randomized 
strategy that at time t, draws Zt+i, ■ ■■ ,Zt from Dt+i, . . . , Dt respectively and Rademacher random 
variables e = (et+i, ■ . ■ , ex), and then outputs 



ft = 



i}2o-t<M[f t ]-M t [i* t ] 
otherwise 



(20) 



where Rt = T,l=i x % ~ CEi=t+i £ i z i + Mt, jt = argmin Rt[j] and %* t = argmin M t [i]. The expected 



regret is bounded as: 



ie[d] 



E [Reg r ] < C E* 1:T E £ 



T 

t=i 



+ 4 £p(^ c ) 

t=i 



When the predictable sequence M t is zero, the algorithm reduces to ft - ej* with 



j t = argmax 



t-l T 

y^.xj — c €iZi 

i=l i=t+l 



which can be recognized as a Follow the Perturbed Leader type update with £ i=1 x% being the 
cumulative loss and T,J =t +i £i z i being a random perturbation. 
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6 Other Examples 



We now provide a couple of examples and sketch directions for further research. 



6.1 Delayed Feedback 



As an example, consider the setting where the information given to the player at round t consists of 
two parts: the bandit feedback (ft, xt) about the cost of the chosen action, as well as full information 
about the past move x t -k- For t > k, let M t = M t (Ii, . . . = Tibzi £s=i _1 x s- Then 



\Mt-M. 



'II 2 



t-k-1 



t-1 



t-k 



t-k-l 



t-1 



i 5 Xs t-iS Xs - (t-ixt-fc-i) S Xs t-i£ k 



4fc 2 



(t-iy 



where M[ = ^ Es=i x « is the full information statistic. It is immediate from Lemma 4 that the 
expected regret of the algorithm is 



E 



T T 
Li=l t=l 



< r]- l K(f*) + Ar]n 2 £l [||x t - Af/|| 2 ] + 32rjn 2 k 2 



t=i 



This simple argument shows that variance-type bounds are immediate in bandit problems with 
delayed full information feedback. 



6.2 I.I.D. Data 

Consider the case of i.i.d. sequence x\,...,Xt drawn from an unknown distribution with mean 
/j, e M. d . Let us first discuss the full-information model. Consider the bound of either Lemma 1 or 
Lemma 2 for Mt = ^ Es=l x s ■ For simplicity, let || ■ || be the Euclidean norm (the argument works 
with any smooth norm). We may write 

\\x t -M t \\ 2 < \\x t -n\ 2 + \\M t -n\\ 2 + 2(x t -fi,M t -fi) . 

Taking the expectation over i.i.d. data, the first term in the above bound is variance a 2 of the 
distribution under the given norm, while the third term disappears under the expectation. For the 
second term, we perform exactly the same quadratic expansion and obtain 

E\\M t - fill 2 < — V E\\x t - /ill 2 < 

0-l) 2 s =l t-1 

and thus 

T 

y,M\\xt - M t \\ 2 < To 2 + (T 2 (logT + 1) 

t=i 

Coupled with the full-information results of Lemma 1 or Lemma 2, we obtain an <D(a\/T) bound 
on regret, implying the natural transition from the noisy to deterministically predictable case as 
the noise level goes to zero. 

The same argument works for the case of bandit information, given that M% can be constructed to 
estimate M[ well (e.g. using the arguments of [10]). 
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7 Auxiliary Results: Improved Bounds for Small Losses 



While the regret bound for the original SCRiBLe algorithm follows immediately from the more 
general Lemma 4, we now state an alternative bound for SCRiBLe in terms of the loss of the 
optimal decision. The bound holds under the assumption of positivity on the losses. Lemma 12 
is of independent interest and will be used as a building block for the analogous result for the 
multi-armed bandit in Lemma 13. Such bounds in terms of the loss of the best arm are attractive, 
as they give tighter results whenever the loss of the optimal decision is small. Thanks to this 
property, Lemma 13 is used in Section 4 in order to obtain bounds in terms of predictable process 
performance. 

Lemma 12. Consider the case when 1Z is a self- concordant barrier over T and sets T and X 
are such that each (f,x) e [0,s]. Then for the SCRiBLe algorithm, for any choice of step size 
r\ < l/(2sn 2 ), we have the bound 



E 



T 
t=l 



We now state and prove a bound in terms of the loss of the best arm for the case of non-stochastic 
multiarmed bandits. Such a bound is interesting in its own right and, to the best of our knowledge, 
it does not appear in the literature. 2 Our approach is to use SCRiBLe with a self-concordant barrier 
for the probability simplex, coupled with the bound of Lemma 12. (We were not able to make this 
result work with the entropy function, even with the local norm bounds). 

Suppose that Nature plays a sequence x±, . . . ,xt € [0, s] d . On each round, we chose an arm j t and 
observe (ej t ,x t ). 



SCRiBLe for multi-armed Bandit [3, 1] 

Input: r, > 0. Let K(f) = - Zti M/M) - Ml - £t"l /W) 
Initialize qi with uniform distribution over arms. Let h\ = gi[l : d — 1] 
At time t = 1 to T 

Let {Ai, . . . ,A.^i} and {Ai, . . . , A^-i} be the eigenvectors and eigenvalues of S/ 2 TZ(h t ). 
Choose it uniformly at random from {1, . . . , [d- 1]} and e$ = ±1 with probability 1/2. 

Set f t = h t + e t X k 1/2 A it and q t = (f t , 1 - Y,ti ft[i])- 
Draw arm jt ~ qt and suffer loss (ej t ,xt). 

Define x t ■= d((e jt ,x t )) e t \}{ 2 -A it . 
Update 



it+i = arg mm 



t 



6=1 



h,Y,x a ) + n(h) 



Lemma 13. Suppose x\, . . . ,Xt 6 [0, s] d . For any rj < l/(4sd 2 ) the expected regret of the SCRiBLe 
for multi-armed Bandit algorithm is bounded as : 

2 The bound of [4] is in terms of maximal gains, which is very different from a bound in terms of minimal loss. To 
the best of our knowledge, the trick of redefining losses as negative gains does not work here. 
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8 Standard Doubling Trick 



For completeness, we now describe a more or loss standard doubling trick, extending it to the case of 
partial information. Let X stand for some information space such that the algorithm receives It eX 
at time t, as described in the introduction. Let ^ : u s (X x T) s ^Ibea (deterministic) function 
defined for any contiguous time interval of any size s 6 [T]. By the definition, *$>(I r , ■ ■ ■ , It, fr, ■ ■ ■ , ft) 
is computable by the algorithm after the t-th step, for any r <t. We make the following monotonicity 
assumption on for any h,...,I t tX and any /i, . . . , f t e J 7 , ^(h-.t-i, fi-.t-i) < ^(h-.t, fi-.t) and 

*(W2:t)<*(Wl:t)- 

Lemma 14. Suppose we have a randomized algorithm that takes a fixed rj as input and for some 
constant A without a priori knowledge of t, for any t > 0, guarantees expected regret of the form 



E 



Y loss(/ t , x t ) - inf Y, loss(/, x t ) 

t=l f&t=l 



<Arj' L + V E[^(I 1:T ,f 1:T )] 



where satisfies the above stated requirements. Then using this algorithm as a black-box for any 
T > 0, we can provide a randomized algorithm with a regret bound 



E 



T T 

Y loss(/ t , x t ) - inf Y loss(/, x t ) 
t=l fzTt=\ 



16^IE[^(/l:T,/l:T)] 



Proof. The prediction problem is broken into phases, with a constant learning rate rfi = t]q2 1 
throughout the i-th. phase, for some 770 > 0. Define for % > 1 

s i+ i = min{r : rn^f(I Si : T , f Si -. T ) > AnJ 1 } 

to be the start of the phase and s± = 1. Let N be the last phase of the game and let sn+i = T+l. 
Without loss of generality, assume A^ > 1 (for, otherwise regret is at most AA/t]q). Then 



E 



T T 

Y^oss( ft,x t ) - inf Y loss (f> x t) 
t=i f&t=i 



< E 



< E 



Sfc+l-l Sfc+l-1 

Y \oss(f t ,x t ) - inf Y loss(/,x<) 

t=s k f^T t=s k 



N 

k=l J s k :s k+ 

N I ' 
Y\ A Vk 1+ m E [y(I Sk : Sk+1 -l,f Sk :s k+1 -l)] 
k=l \ fwk+l- 1 1 

- N 

<2E Y^Vk 1 
.k=i 

where the last inequality follows because rjk^(Is k : Sk+1 -i, fs k -.s k+1 -i) ^ Ar/^ 1 within each phase. Also 
observe that 

VN-l^(I SN ^:s N ,fs N ^:s N ) > 

which implies 



A 



A 
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by the monotonicity assumption. Hence, regret is upper bounded by 



Af Af 

2 £ Ar,, 1 = 2Ar ]o 1 2 N £ 2*"" < 4A^2 N < 8^ A *(/ 1:T ,/i:t) 

fe=i fe=i 



Putting the arguments together, 
T T 

E £ loss ( ft, x t )- inf Y, loss (/j^t) 

t=l /£Ft=i 



< 8E [y/A *(/l:T,/l:T)] < 8^ E [*(/l:T, /l:r)] 



Now, observe that the rule for stopping the phase can only be calculated after the first time step 
of the new phase. The easiest way to deal with this is to throw out N time periods and suffer an 
additional regret of sN (losses are bounded by s). Using r/o = 4A/s this leads to additional factor of 
sN < s2 N = 4A»7o 1 2 JV < 8y/A ^(I hT , fi-.r), which is a gross over-bound. In conclusion, the overall 
bound on regret is 



E 



T T 

£loss(/t,zt) - inf ^loss(/,Xi) 
t=i f&t=i 



<16V^E[*(/i:r,/i:r)] 



□ 



We remark that while the algorithm may or may not start each new phase from a cold start (that 
is, forget about what has been learned), the functions Mj may still contain information about all 
the past moves of Nature. 

With this doubling trick, for any of the full information bounds presented in the paper (for instance 
Lemmas 1, 2, 3 and 5) we can directly get an algorithm that enjoys a regret bound that is a factor 
at most 8 from the bound with optimal choice of r/. 

For Lemmas 4, 6, 7 and 8, we need to apply the doubling trick to an intermediate quantity, as 
the final bound is given in terms of quantities not computable by the algorithm. Specifically, the 
doubling trick needs to be applied to Equations (6), (8), (9) and (10), respectively, in order to get 
bounds that are within a factor 8 from the bounds obtained by optimizing 77 in the corresponding 
equations. We can then upper these computable quantities by corresponding unobserved quantities 
as is done in these lemmas. To see this more clearly let us demonstrate this on the example of 
Lemma 8. By Equation (10), we have that 



E 



T T 

t=i t=i 



< if 1 K(f*) + 2T 1 n 2 E 



^{{fuXt-M^f 



t=i 



Now note that ((ft,xt - M^}) 2 is a quantity computable by the algorithm at each round. Also 
note that 2r\r? Y,T=i((fti x t ~ M^ 1 )) 2 satisfies the condition on required by Lemma 14, as the sum 
of squares is monotonic. Hence using the lemma we can conclude that 



E 



t=i 



< 16 



N 



2n 2 ft(/*)E 



u=i 



(21) 
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The following steps in Lemma 8 (see proof in the Appendix) imply that 



E 



Li=l 



< 2 IE 



m£Y,ht-Mt\ 

7reIIt=l 



+ 32|n| 3 iog(T|n|) 



Plugging the above in Equation 21 we can conclude that 



E 



E(/ t ,z*>-E(/V*> 

t=l t=l 



< 16 



\ 



An 2 H{f*) I E 



T 



inf £ fxt-Mfl 

_7reIIt = l 



+ 32|n| 3 io g (T|n|) 



This is exactly the inequality one would get if the final bound in Lemma 8 is optimized for rj, with 
an additional factor of 8. With similar argument we can get the tight bounds for Lemmas 4, 6 and 
7 too, even though they are in the bandit setting. 
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A Appendix 



Proof of Lemma 1. Define gt+i = argminj e ^ 77 (/, X!s=i x s) +^(/) to be the (unmodified) Follow 
the Regularized Leader. Observe that for any /* e T, 

E (/t - /* = E (A - - M t) + E (/* - ft+i^t) + E - (22) 
t=i t=i t=i t=i 

We now prove by induction that 

E (ft - g t+ uM t ) + e (at+i.xt) < E + ^(D- 
t=i t=i t=i 

The base case r = 1 is immediate since M\ = 0. For the purposes of induction, suppose that the 
above inequality holds for r = T - 1. Using /* = and adding (fx - gT+i,Mx) + (<?t+i, £t) to both 
sides, 

T T T-l 

E (/* _ 9t+i,M t ) + E (gt+i,x t ) < E (/t,^> + V^Kifr) + </t - gT+i,M T ) + (g T +i,x T ) 
t=i t=i t=l 

^ (/t> E x * + + v'^Ut) - (gr+i,M T ) + (gT+i,x T ) 
< igr+i, E x * + + v'^igT+i) - (9t+i,M t ) + {gT+i,x T } 

by the optimality of /r and flr+i- This concludes the inductive argument, and from Eq. (22) we 
obtain 

E (ft - f* ,x t ) < E (/* - 9t + i. *t - M t ) + »f 1 W) (23) 
t=i t=i 

Define the Newton decrement for &t(f) - V (f, Es=i x « + -^t+i) + ^(/) as 

A(/,$ t ) := ||V$t(/)||} = [|V 2 * t (/)- 1 V* t (/)||/. 

Since 1Z is self-concordant then so is with their Hessians coinciding. The Newton decrement 
measures how far a point is from the global optimum. The following result can be found, for 
instance, in [ ! ■ '■]: For any self-concordant function 1Z, whenever \(f,lZ) < 1/2, we have 

||/-argmin^|| / <2A(/,^) 

where the local norm || ■ ||j is defined with respect to 7Z, i.e. \\g\\f ■- \J g T (V 2 1Z(f))g. Applying this 
to $t and using the fact that V$t-i(<7t+i) = i](M t - x t ), 

\\ft-gt+i\\f t = \\gt+i -argmin* t |/ t < 2A(# t+1) $ t ) = 2r}\\M t - x t \\} t . (24) 
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Hence, 

T 

z 

t=l t=l 



£ (ft -f*,xt)<t, - 9t + i\\t\\x t - M t ||* + rf ^(D 



<2 V Y l (\\x t -M t \\} t ) 2 + V - 1 K(n, 

t=i 

which proves the statement. □ 

Proof of Lemma 2. For any /* e T, 

(ft ~ f*,x t ) = (f t -g t+ i,x t - M t ) + (ft-9t + i,M t ) + {g t+1 - f*,Xt) (25) 
First observe that 

(ft - g t+ l,x t - M t ) <\\f t - g t+1 \\ \\x t - M t \l < I \\x t -M t \\l + ±- \\f t - g t+1 f . (26) 

2 2r] 

On the other hand, any update of the form a* = argmin aey i (a, x) + Dn(a, c) satisfies (see e.g. [6, 14]) 
{a* -d,x)<(d- a*,VK(a*) - Vft(c)) = D n (d, c) - D n (d, a) - D n (a*,c) . (27) 

This yields 

(ft-gt + i,M t )<-(D n (gt + i,gt)-DK(gt + iJt)-Dn(ft,gt)) ■ (28) 
V 

Next, note that by the form of update for gt+i, 

(gt + i - f*,xt) = - (g t+ i - f*,VR(gt) - VK(g t+ i)) 
V 

= - (D n (f* ,gt) - D n (f ,9t + i) ~ D n (g t+1 ,g t )) , (29) 
V 

and the same inequality holds by (27) if gt+i is defined as in (5) with a projection. Using Equations 
(26), (29) and (28) in Equation (25) we conclude that 

(ft-r,x t )< ^\\x t -M t \\l + ±\\f t -g t+1 \\ 2 

2 2r) 

+ - (D n (gt+l, 9t) - D n (g t+ i, f t ) - D n (f t , g t )) 
V 

+ -(D n (f,g t )-D n (r,g t+1 )-D n (gt +1 ,gt))) 

< I \\x t - M t \\l + ±- \\f t - g t+ if + - (Dn(f*,9t) ~ D n {f\gui) - D n (g t+1 ,f t )) 
2 2r\ i] 

By strong convexity of TZ, D n (g t+ i,f t ) > \ \g t +\ - / 4 || 2 and thus 

(ft -f\x t )<\ jxt - M t \\l + - (D n (f,g t ) - D n (f*,g t+1 )) 
2 rj 
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Summing over t = 1, . . . , T yields, for any /* e T, 

E(/t-r.^>^?Eik-^n*+— 

t=i 1 t=i V 

where i?^ ax = max/ e ^7^(/) - mm/ 6 jr "£(/). □ 

Proof of Lemma 3. The proof closely follows the proof of Lemma 2 and together with the tech- 
nique of [2]. For the purposes of analysis, let gt+i be a projected point at every step (that is, 
normalized). Then we have the closed form solution for ft and gt+i- 

(<) = eM-vZU Mi)} and m __ eM-vZilUs^-yMm 

Hence, 

g t+ i(i) _ expi-nZUxsji)} gj=i exp{-7? x s (j) - 7?M t (j)} 
ft(i) ~ exp{-7? E*;i x s (i) - V M t (i)} exp{-?7 Zli ^0')} 

zU e M-riZt:\xsU)-vMtU)} 



= exp{-Ti(x t (i)-M t (i))}- 



Ej=i exp{-7? E* s= i ^0')}exp {-77(xt(i) - M t (i))} 
exp{-r](xt(i) -M t (i))} 



E|=i /t0') exp {-f7(x t (i) - M t (i))} (30) 
For any f* e T, 

(ft - f\x t ) = (ft-g t+ i,x t - M t ) + (ft-gt + i,M t ) + (g t+ i - f*,x t ) (31) 
First observe that 

(ft-gt + i,Xt-M t )<\\f t -g t+1 \\ t \\x t -M t \\* t . (32) 
Now, since V 2 7£ is diagonal, 

II /t - guilt = E(/t(<) - »n(<)) a //t(<) = -i + E /«(0G&n(0//*(0) a 

i=l i=l 
using the fact that both / t and <?t+i are probability distributions. In view of (30), 



\\f t -g t+1 f t =-l + E 



( exp{-Z} V 
\Eexp{-Z} j 



where Z is defined as a random variable taking on values r](xt(i) - Mt(i)) with probability /t(i). 
Then, if almost surely EX - X < a/2, 



E 
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since the function (e y - y - l)/y 2 is nondecreasing over reals. As long as \rj(xt(i) - Mt(i))\ < 1/4, we 
can guarantee that ¥,Z - Z < 1/2, yielding 



\\f t -g t+1 \\ t <2VEZ2 = 2^ 
Combining with (32), we have 



N j=i 



E Mi)(vMi) - M t (i))) 2 = H\xt - Mid* 



(/ t - 5t+1 ,x t -M 4 )<2r ? (||x t -M t ||;) 2 . (33) 



The rest similar to the proof of Lemma 2. We have 



and 



(/ t -fft +1 ,M t )<i( J D^( 5t+ i jfft )-D 7 j( fft+ i,/ t )-^(/i, 5t )) . (34) 
77 



(gt+1 ~ f*,x t ) < - (D n (f*,g t )-D n (f*,g t+1 )-D n (g t+1 ,g t )) , (35) 
V 



We conclude that 



(ft-r,x t ) < 2rj(\\x t -M t \\lf 



+ - (Dn(gt+i,gt) - D n (g t+1 , f t ) - D n (f t , g t )) 

7] 

+ - (Dn(r,gt)-n n (t,g t+1 )-D n (g t+1 ,g t ))) 

7] 

< 2r,(\\x t - M t \\l) 2 + - (D n (r,g t ) - D n (f*,g t+1 ) - D n (g t+l ,f t )) 
V 



Summing over t = 1, . . . , T yields, for any /* e T, 



T T 



E(/t-A^>^2r ? E(lk-Mt|in 2 + 1 ^ 
t=i t=i V 



Proof of Lemma 4- In view of Lemma 1, for any /* e T 

T T T 



E fa,*) - e < v-'nn + - M t \tf 

i=l t=l i=l 



2L „ _/„ ,,n „*X2 



^r ? - 1 ^(r) + 27 ? ^n 2 ((/ t ,^-Mi)) 2 (| ei A- t /2 A Jt |') 

t=i 

<r ? - 1 ^(r) + 27 ? ^n 2 ((/ t ,^-M t )) 



t=i 

r 



<r ? - 1 ^(/*) + 2??n 2 ^||x i -M ( 



2 

t=l 
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□ 



where for simplicity we use the Euclidean norm and use the assumption < 1; any primal-dual 
pair of norms will work here. It is easy to verify that xt is an unbiased estimate of xt and E [f] t = hf. 
Thus, by the standard argument and the above upper bound, 



E 



T T 

-£(/Vt) 

t=l i=l 



= E 

= E 



T T 

t=l t=l 
T T 



t=l 



t=l 

T 



< v^Kif*) + 2r, £ n 2 E [((/ t , x t - M t )f] 
t=i 

T 



t=i 



The second statement follows immediately. 



□ 



Proof of Lemma 5. First note that by Lemma 2 we have that for the Mt chosen in the algorithm, 

E (ft,*) - E </*.**> < v-'rL, + ? E N - 
t=i *=i z t=i 



<^rL x + ?E E^mn-m; 

Z t=l7ren 



Til 2 



(Jensen's Inequality) 



2 



\e- 1/ Ventti 



x t -Mf|| 2 + io g |n| 



where the last step is due to Corollary 2.3 of [7]. Indeed, the updates for q t 's are exactly the experts 

2 

algorithm with pointwise loss at each round t for expert tt g II given by || — xti*. Also as each 

ii 2 

M% e X the unit ball of dual norm, we can conclude that \Mf — xt\+ < 4 which is why we have a 
scaling by factor 4. Simplifying leads to the bound in the lemma. □ 

Proof of Lemma 6. In view of Lemma 1, for any f*tT 



E (h t ,x t ) - e (f^t) < rr x nn + 2^E(ii^ - M tU) 2 

t=\ t=l t=l 

= rf^U* ) + 2r? E n\(f t ,x t - M t )f (\e t \f aJ|* f 



<r l - 1 K{f) + 2r ] n 2 Y J {UuX t -M t )f 



t=l 



It is easy to verify that xt is an unbiased estimate of xt and E[/] t = ht. Thus, by the standard 
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argument and the above upper bound we get, 



E 



E </*.**> -E (/*>**> 
t=i t=i 



j=i t=i 

T 

-1 



<T]- 1 TZ(f*) + 2i 1 n z E 



Y,((ft,x t -M t )f 
Lt=l 



This proves the first inequality of the Lemma. Now by Jensen's inequality, the above bound can 
be simplified as: 



E 



T T 
i=l t=l 



< ri- l K(f*) + 2r]n 2 E 

< r]- l TZ{f*) + 2r]n 2 E 



Y,((ft,xt-M t )) 
t=i 



< rj^nn + 8 V n 2 (-?-) (e inf E«/ t ,x t " M t)f + log|n| 

Ve-l/\ TrelTftl 



where the last step is due to Corollary 2.3 of [7]. Indeed, the updates for q t 's are exactly the experts 
algorithm with point- wise loss at each round t for expert ir e II given by ((ft,xt - M^)) 2 . Also as 
each Mf e ^ the unit ball of dual norm, hence we can conclude that ((ft,xt - M^)) 2 < 4 which is 
why we have a scaling by factor 4. Further since ||/t|| < 1 we can conclude that : 



E 



T T 

t=i t=i 



< ri- l TZ{f*) + Srjn 2 (-?— ) (e inf V ||jc t - Mff + log |n|) 

V e - 1 / \ ttl ' / 

< ri- l TZ{f*) + 137?n 2 (e inf £ |x t - M^f + log |n|) . 

\ 7ren i= l / 



This concludes the proof. 



□ 



Proof of Lemma 7. First note that by Lemma 2, since M^ 1 is the predictable process we use, 
we have deterministically that, 

E (ft, xt) - E U\x t ) < v - 1 rI^ + \ E II** - M T II* 
t=i t=\ z f=i 

Hence we can conclude that expected regret is bounded as : 



E 



T T 

t=i *=i 



en- m ; 

U=l 



TTj II 2 



(36) 



This proves the first inequality in the lemma. However note that the update for q^'s is using 
SCRiBLe for multiarmed bandit algorithm where the pointwise loss for any tt e n at round t given 
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by \\xt - M[\\^. Also note that maximal value of loss is bounded by max.M t ,x t \\xt ~ MTI* - 4- Hence, 
using Lemma 13 with s = 4 and step size 1/32|II| 2 , we conclude that 



E 



n t \\2 
t II * 



t=l 



< 2 inf £ \\x t - M[\\l + 64|n| 3 log(T|n|) 



Using this in Equation (36) we obtain 
E 



T T 

t=i t=i 



finfE 

\7rent=l 



^-Mf||2 + 32|n| 3 log(T|n|) 



Proof of Lemma 8. In view of Lemma 1, for any f*tT 

XT T 



t=l 



t=l 



E to, 5*> - E </*,**) * »r 1 w) + 217 E( i^t - M r i?) a 

= rf 1 ^/* ) + 2r, E n 2 «/ t>a * - M-)) 2 (J^f A it Q" 



f=i 

< r^W*) + 2»jn 2 E«/^t - M^)) 2 
We can bound expected regret of the algorithm as: 
E 



T T 
t=l t=l 



T T 

= E E il:t-l,Tl:t - E (/*> 

t=l t=l 

= E E ^iJ(^^)]-E E [(/*» £ *>] 

E f)</H,^)-i; 
t=l t=l 

T 



< 7f 1 K(f*) + 2i ] n 2 E 



Y,((ft,xt-M?)) 2 
t=i 



□ 



(37) 



This gives the first inequality of the Lemma. However note that the update for q^s the distribution 
over set LT is obtained by running the SCRiBLe for multi-armed bandit algorithm where pointwise 
loss for any it g LT at round t given by ((ft,xt - M[}) 2 . Also note that maximal value of loss is 
bounded by 4. Hence using Lemma 13 with s = 4 and step size l/32|n| 2 we conclude by the regret 
bound in that lemma that 



E 



t=l 



< 2E 



inf £«/t,*t - M?)) 2 + 64|n| 3 log(T|n|) 

_7r€lli=l 



Plugging this back in Equation (37) we conclude that 



E [Reg T ] < r]- x n{f*) + 4r]n 2 IE 

< ri^Kif*) + Arjn 2 (e inf £ 



inf Y,((ft,x t -M?}) 2 

_7rent=l 



\x t -MT 



+ 32|n| 3 log(T|n|)J 

+ 32|n| 3 iog(T|n|) 
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□ 

Proof of Lemma 9. To show admissibility using the particular randomized strategy qt given in 
the lemma, we need to show that 

sup {E f „ qt f T x t + Rel T (r\xi, ...,x t )}< Rel T (F\xx, x t -i) 

The distribution q t is defined by first drawing zt+i ~ D t+ i, ...,zr~ Dt and et+i, ■ ■ ■ £t Rademacher 
random variables, and then calculating f t = ft(z t+ i:T, et+i-.r) as in (17). Hence, 



sup {E/^g t / T x t + Rel r (J r |xi,...,x t )} = sup 

X t <iCt(xi;t-l) XtiCt{xi: t -l) 



E fjx t + E 

e t + l:T H+V.T 
Z t+1:T Z t+1:T 



T t 

c ^ €iZi — Xj 

i=t+l i=l 



< E sup \fjx t + 



T t 
i=t+l i=l 



Now, with defined as 



ft = argmin sup -{ (5, x t ) + 

fiie^ 7 a; t 6C t (xi : t-i) 



T t 
i=t+l i=l 



for any given Zt+i:T, Ct+i--T, we have 

r T t 

sup |/ t T x t + C Y, f-iZi-^x, 

xteCt(x 1]t -i) { i=t+l i=l 



= inf sup \ g T xt + 

gtF x t eCt(xv.t-i) 



T t 

c ^ €iZ{ — Xj 

i=t+l i=l 



We can conclude that for this choice of qt, 



sup I E [/ T £t] + Rely (F\xi, . . . , xt) \ < E inf sup I g T xt + 

x t eCt(xi, t -i) [f~qt J e ^l9^Fx^Ct{x 1 ..t-l) 



= E inf sup 



E 



^+i;£ffe^peA(Ct(xi:t-i))zt~pL 



g J x t + 



Z t + 1:T 
T 



T t 

C ^ tiZi — ^ Xi 

i-t+1 i=l 



i=t+l i=l 



= E sup inf] E [ 5 T xt] +E Xt . p 



«t+X|T p6 A(C?t(xi ! t_i))a6y [xt~p 



= E sup 



E [a*] 

x t ~p 



+ ^xt~p 



c ^ f-iZi — Xj 

i=t+l i=l 
T t 

C ^ t%Zi — x, 

i=t+l j=l 



In the next to last step we appealed to the minimax theorem which by linearity of the expression 
in g and the fact that T is a compact convex set; furthermore, the term in the expectation is linear 
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in p. By triangle inequality, 



IE [x t ] 




x t ~p 





T t 
C Y, EiZi — Y, Xj 
i=t+l i=l 



< E 



x t ~p 



T t-1 

C Y € i Z i ~ X! Xi + E ~ Xt 

i=t+l i=l ^-P 

T t-1 

- "^.a^p C XI e * Z * ~ X/ g * ~*~ X t ~ x < 
i=t+l i=l 

T t-1 

C ^ ejZj - Y x i + e t(x' t -x t ) 



^Xt,x' t ~'[}^et 



i=t+l 



i=l 



where we introduced a Rademacher random variable et via the standard symmetrization argument. 
We now introduce "centering" by Mt{x\-.t-i)- The above expression is equal to 



Hence, 



^XtjX'.-p^tt 



T t-1 

C Y, e i z i ~ Y, x i + e <( x t ~ M t (x 1:t -i)) + e t {M t {xi:t-i) - x t ) 
l=f+l 1=1 
T t-1 

C Y e i Z i - £ X i + 2e t( x t ~ M t (x lvt -i)) 
i=t+l i=l 



E sup 

^>A(C t (z 1;t _i)) 



= E sup E 2t . p E et 

e t +i-.T peA(C) 



E [^] 



+ E^j^p 



T t 

C Y, € i z i ~ Yi Xi 
i=t+l i=l 



T t-1 

C Y e i Z i ~ Yj Xi + 2€tZt 
i=t+l i=l 



where in the last step we pass to the set of distributions on C = {z ■ \\z\\ < at}- By Assumption 1, 
the last expression is upper bounded by 



E E^ Dt E e 

«t + l:T 
Z t+1:T 



T t-1 

C Y e i Z i ~Y, X i + Ce t Z t 
i=t+l i=l 



= Rel T (^"|xi,...,x 4 _i) 



□ 



Lemma 15. Consider the case when X is the £^ unit ball and T is the unit ball. Let Rt be 



any random vector and define jt = argmax |i?t[j]|. Let 



ft(Rt) = argmin a t Y l/MI + *tf[jt ]sign(i?t[j*]) + (/, M t ) , 



where Mt is any fixed vector in ~R N . Then 



E 
Rt 



sup {{fti^^ + Mtj + WRt + zl^} 



< E 
Rt 



inf sup {{f^ + Mtj + WRt + zl^} 



+ 4 P (£?) 



where St is the event that the largest two coordinates of Rt are separated by at least Ao~t 
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Proof of Lemma 15. For any given vector R t , M t € W and any / e J, 

sup {{f,z + M t ) + \\R t + z\\ 00 }= sup {a t {f,z) + \\R t + a t z\\ 00 } + {f,M t } 

z'-Wzj^Kat ze{-l,l} d 

Leaving out the (f,Mt) term, we can further rewrite the above supremum as 

sup (cr t ^/[i]-2;[i]+max|i? t [j] + CT^[j]|} = max sup [a t Y /W • z[i\ + \Rt[j] + <rtz[j]\\ 

ze{-l,l} d { i=l j e M J i e M ze{-l,l} d { i=l J 

By optimizing over coordinates i + j, this is equal to 

max I a t Y l/MI + raax{\R t [j] + a t \ + cr t f[j] , \Rt[j] - crt\ - &tf[j]} \ 



t*.l 



= ot ll/lli + max { -a t \ f[j] \ + max{\R t [j] + a t \ + a t f[j] , \R t [j] - a t \ - a t f[j]}} 
Under the event St, the maximum over j will be achieved at j£, thus yielding 

= a t WfW, + \R t [j;]\ + o-t - 2a t |/[i t *]|l {sign(/[#]) * sign(i? t [j*])} 

while outside of ft the above solution can be off by at most 4. We may also write the above 
expression as 

iRtUm + vt + vt E \f[i]\ + a t f[jnsign(R t [jn) ■ 
So, under the event £t, the minimum is attained at 

ft(R t ) = argmin a t Y |/[*]| + a t /[#]sign(i? t [j t *]) + (/, M t ) 1 

and so 

sup {(/tW^ + M^ + li^ + zL}^ inf sup {{/,z + M t > +11^ + 20 . 

ziflzfl^^fTt f^T z: || ^ || x <at 

On the other hand on the event 

sup {(MRt^z + Mj + WRt + zW^-mf sup {(/,* + M t ) + ||i?t + *L} < 4 

z:||z|L<<x t /e^ 7 z: || z || ^Scrt 

and so 

sup {(/ t (l?0,* + M t ) + |i2* + *L}<iiif sup {(/,z + M t ) + || J R t + z|| 00 } + 41{^ c } . 

z-.jzl^-Cat ftF z: || z || ^So-t 

Taking expectation proves the result. □ 
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Proof of Theorem 10. From Lemma 9 we have that the randomized strategy which at time t, 
draws zt+i, . ■ ■ ,zt from Dt+i, . ■ ■ , Dt respectively and Rademacher random variables e = (et+i, . . . , ex), 
and then picks 

T t-1 
C ^ i-iZi — Xj — Xt 
i=t+l i=l 



/ t = argmin sup \(g,x t ) + 

g^T X t eCt(xi: t -l) 



is admissible w.r.t. relaxation 

Rel T (^"|xi,...,x 4 ) = 



E E e 

zt+i~D t+ i,...z T ~D T 



T t 
C ^ 6iZ{ — Xj 
i=t+l i=l 



However by Lemma 15, we have that for the randomized algorithm that at time t, draws zt+i, ... ,Zt 
from D t+ i, . . . , Dt respectively and Rademacher random variables e = (e^+i, . . . , ey), and then picks 



f t (R t ) = argmin a t £ + (T t /[£]siga(i2*[tf ]) + (f,M t ) 
/i/|li<l 



(38) 



**3t 



we have that 



E 
h 1 , 



sup {{ftiRt^z + M^ + WRt + zW^} 



< E 
Rt 



inf sup {(f,z + M t ) + \\Rt + z|| M } 



+ 4 P (£ t c ) 



Hence we can conclude that the Randomized strategy that at time t, draws Zt+i, ■ ■ ■ ,Zt from 
D t+ i, . . . , Dt respectively and Rademacher random variables e = (et+i, . . . , ex), and then picks 
ft(Rt) = argmin {a t Ei*,* l/WI + ^tf[jt] si E n (Rt[jt]) + (/> Mt)} is admissible w.r.t. the relaxation, 



Rel T (^"|xi,...,x t ) = E E e 

£t+i~-Dt+iv-z<r~-DT 



T t 
i=t+l i=l 



+ 4 £ P (£ t c ) • (39) 
i=t+l 



Hence as mentioned in Equation (12) we can conclude that the expected regret of the randomized 
strategy that plays ft(Rt) on round t is bounded as 



E [Reg T ] < C E, 1:T E e 



T 

t=i 



+ 4 £P(£ t c ) . 



Now we claim that the update in Equation (38) is same as the one in Equation (19) given in the 
theorem statement and so the above regret bound is true for the update provided in the theorem. 
To prove this, we first show that the ft(Rt) given in Equation (38) is on a vertex of the t\ ball. To 
see this note that we can rewrite the minimization as 

I" d d 

argmin argmin \ a t £fl{i] + cr t s[jZ]g[jt]sign(Rt[jZ]) + £ s[%[z]M t [z] 

s:{±l}d g:Vie[d],g[i]>0Xii9[i]<l ^ *=1 *=1 

and ft(Rt) = (s[l]<?[l]> • ■ • ^M^M)- That is vector s is the sign vector, sign( ft(Rt)), and vector g 
is the magnitude vector, \ft\. Further note that given se {±1} , the minimization problem in terms 
of g is linear in g. Hence the solution will be at a vertex of the set {g ■ Vi e [d], g[i] > 0, T,f=i g[i] ^ 1} 
as its a linear optimization problem. Hence either g - or g - ej for some if [d]. However the 
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solution is clearly not ft(Rt) = as the minimum has to at least be negative unless Mt and Rt are 
both 0. Thus we see that g-ei for some i and so ft(Rt) is of form s[i]ej and so g is on the vertex 
of the £i ball. Hence we conclude that update in Equation (18) can be rewritten as ft(Rt) = st&i t 
where 



(it, s t ) = argmin {a t l {i * j t *} + crtsl {i t = sign(i? t [j t *]) + sM ( [i]} 

«e[d],se{±l} 

Let = argmax |Mt[i]| it is easy to see that the ft(Rt) = st&% t is given as follows : 



ft(Rt) 



-sign(M t K])e^ if a t - \M t [i* t ]\ < ~ Wt sign(i? t [j t *]) + M t [f t }\ 

-sign(a t Rt[jt] + M t [j*])ej* otherwise 



Hence we have shown that the update in Equation (19) is admissible w.r.t. relaxation in Equa- 
tion (39) and so enjoys the expected regret bound : 



E [Reg T ] < C E* 1:T E £ 



T 

t=i 



+ 4 Z P ( £ t) 



t=i 



thus proving the theorem. 



□ 



Proof of Corollary 11. For the case when T is the simplex, since for each f € T and each i e [d], 
f[i] > 0, if we add an arbitrary number B to each coordinate of xt e [-1, l] d , the regret remains 
unchanged, that is, 

£ (ft,x t ) - inf £ (/, xt) = Z(ft,Xt + Bl)- inf £ (/, x t + Bl) 

t=l ftFt=l t=l f^t=l 

where 1 = (1, ...,1) e M d . Hence, let us consider adding to each coordinate of every xt a large 
constant B < (for instance think of B < -e T or smaller), and set xt - xt + Bl and Mt = Mt + Bl. 
Notice that with predictable process given by Mt and with adversary playing xt we still have that 
nxt - MA = \\zt\\ < at- We now claim that the algorithm for the l\ ball from the previous section 
operating on a^'s has the following properties: it (a) produces solutions within simplex, (b) does 
not require the knowledge of B, and (c) attains a regret bound that does not depend on B. We will 
further also show that this solution is the one given in Equation (20) of the Corollary statement. 

Let us first begin by noting that when we look at the linear game on input sequence x\, . . . ,xt, 
even when we take T to be all of the t\ ball, the comparator will in fact be in the positive orthant. 
To see this note that since x± e [-1, 1], each xt is in the negative orthant. Hence, 



T T 

inf £(ei,x t ) = - inf = 

Hd]t=l /:||/l 1 <lt=l 



T 



If we further show that each ft picked by algorithm in Theorem 10 is also in the simplex then we 
effectively show that the algorithm from previous section can be adapted to play on the simplex 
by simply adding this large negative number to each coordinate of x^s. Further the randomized 
algorithm also enjoys the same regret bound provided in previous section and since the regret bound 
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only depended on magnitude of Zt = xt - M t = xt - M t , we can conclude that the regret bound only 
depends on at and is independent of B. 

Notice that argmax |Mt[i]| = argmax - M([i] = argmax - M([i] - B - argmin M^[i] = i* t . Similarly 

ic[d] ie[d] ie[d] ie[d] 

we also have that argmax = H where Rt = Ei=i x i ~ CY,T=t+i € i z i + ^t- Now note that the 

ie[d] 

algorithm of the previous section for the game where adversary plays it is given by 



ft 



-sign(M t [i* t ])e n if a t - \M t [i* t ]\ < - \a t sign^Tl) + M t [f t ]\ 

-sign( a t Rt [jt ] + M t [j* ] ) ej* otherwise 



Since B is a very large negative constant, we have that sign(i^[j t *]) = sign(Mi[j t *]) = -1 and 
that \Mt[i* t }\ = -M t [i* t ] = -M t [i* t ] - B and similarly, \a t signet [#]) + M t [f t }\ = °t - M[#] = 
at - Mt[jt] - B. Therefore, we can rewrite fts as 



ft 



e q if 2a t <M t [j*]-M t [i* t ] 



We conclude that the randomized algorithm for the t\\iao case from the previous section on the 
sequence given by xt produces ft's in the simplex. Further regret of the algorithm for on sequence 
xi, . . . , xt is same as its regret on x\, . . . , xt and this regret is bounded as 



E [Reg T ] < C E Zl;T E e 



T 



Y. e t z t 

t=i 



T 



+ 4 £P(£ t c ) 



This concludes the proof of the corollary. Notice that throughout we assumed B is a negative 
constant with large enough magnitude so that for any t, sign(i?i) = -1 (or at least this is true with 
very high probability). However since the result did not depend on B nor does the final algorithm 
we can simply take B to have magnitude tending to oo so that sign(it^) = -1 almost surely. 



□ 



Proof of Lemma 12. In view of Lemma 1, for any /* e T 

XT T 



£ (h t ,x t ) - £ (f\xt) < V^nn + 2r? £(|H| t * ) 2 

i=l t=l i=l 



= v-'Kif*) + 2r?E n 2 {(fux t )) 2 (\e t \f A^ff 
t=i ' 

< v^nfl + 2s vn 2 £ (ft,x t ) (ht^K If 
t=l t 

<r } - 1 n{r) + 2sr 1 n 2 Y j {f u Xt) . 

t=l 
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It is easy to verify that xt is an unbiased estimate of xt and E [f] t = hf. Thus, 



E 



E(/*^>-£(/V*> 

t=l i=l 



= E 
= E 



T T 

Lf=l t=l 
T T 

Lf=l f=l 

T 



< rj- 1 TZ(f*) + 2s T]n 2 E 



t=i 



Hence we can conclude that 
E 



T 

E(/t' x *) 

Li=l 



1 (infEa^ + r?- 1 ^/*) 



□ 



Proof of Lemma 13. We are interested in solving the multi-armed bandit problem using the 
self-concordant barrier method so we can get a regret bound in terms of the loss of the optimal 
arm. We do this in two steps, first we provide an algorithm for linear bandit problem over the 
simplex. That is we provide an algorithm for the case when learner plays on each round qt e A([<i]), 
adversary plays loss vector xt e [0,5]^ and learner observes (qt,xt) at the end of the round. Next 
we show that this bandit algorithm over the simplex can be converted into a multi-armed bandit 
algorithm. To this end let us first develop a linear bandit algorithm over the simplex based on 
self-concordant barrier algorithm (SCRiBLe). 



Bandit algorithm over simplex: Note that one can rewrite the loss of any q e A([d]) over any 
x e [0, s] d as 

(q,x) = (q[l :d-l],x[l :d- 1]> + (1- (q[l : d- 1], l))x[d] 
= (q[l : d- l],x[l : d- 1] - lx[d]) + x[d] 
= ((q[l :d-l], 1), (x[l : d- 1] - lx[d],x[d])) 

Since the above we have for any distribution over the d arms q, and any loss vector x, we see that 
solving the linear bandit problem where learner picks from simplex and adversary picks from [0, s] d 
is equivalent to the linear bandit game where learner picks vectors from set J-' and adversary picks 
vectors from set X' where 

F' = 1) : / 6 s.t. Vi€[d-l],/[i]>0,E/W<l} 

and X' = {(x[l : d- 1] - lx[d], x[d]) :x<eX}. Now we claim that the function K(f) = - ££f log(f[i])- 
log(l - EjJi /[*]) is a self-concordant barrier of the set T' . To see this first note that the func- 
tion 1Z(f[l : d - 1]) = - T,f=i l°g(/H) _ l°g(l _ £i=l /W) i s a self-concordant barrier on the set 
{/ e M d_1 : Vi 6 [rf- l]f[i] > 0, E^ti 1 /[^] < 1}- Now since the function TZ is simply the same as the 
function 1Z applied only on the first d-l coordinates of the input it is easy to see that 1Z is a self 



35 



concordant barrier on T' . Hence using Lemma 12 we can conclude that for the SCRiBLe algorithm 
with this reduction with any choice of rj > and any q* e A([d]), 



E 



T 
,t=l 



1 ( T 
1 - (2sd 2 )rj \ t ti Hd] 

_ J— -[ inf ^( 9) x t > + l + d7 7 - 1 log(dT) 
1- (2sd 2 )?7 \ 9£ A([d])t=i , 



T 



l-{2sd 2 )rj \je[d]t=i 



inf XKej,Xt) + l + rfr/ _1 log(rfr) 



(40) 



where the last step obtained by picking q* = (l-l/T)ej*+Y,i±j*(l/(d-l)T)ei withj* = argmin Y,f=i( e j 

Thus we have a linear bandit algorithm over the simplex with the bound given in Equation (40). 
Now we claim that this algorithm can be used for solving multi-armed bandit problem. 



Using linear bandit algorithm over simplex for multi-armed bandit problem: We claim 
that the algorithm we have developed for the simplex case can be used for the multi-armed bandit 
problem. To see this note first that for any choice of q±, . . . , 6 A([d]) and any choice of x%, . . . , xt, 



E 



T 
U=l 



inf (q,x t )=M 

9*A(M) 

= E 



E E [(e jt ,x t )] 

t=ljt~qt 



inf (ei,x t ) 



T 



E(ej t ,x 4 ) - inf (ei,x t ) 
t=l K[d] 



Hence this shows that if we have an algorithm that outputs q\ , . . . , qx then on each round by 
sampling the arm to pick from qt we get the same regret bound. However note that to run a 
bandit algorithm over the simplex we needed to be able to observe (qt,xt), while in reality we 
only observe (ej t ,xt). There is an easy remedy for this. Note that we needed to observe (qt,xt) 

1/2 1/2 

only to produce the unbiased estimate xt ■= d((qt,Xt)) £t\ -Aj t . However, d({qt,xt)) £t\l ' ^-k = 
%t~9i (( e it ' x t)) £ t^ 2 ' Aj t . Hence, d({ej t , xt}) £t^ 2 • Aj t is also an unbiased estimate of xt and 
so the algorithm can simply use (ej t ,xt) to build the estimates while enjoying the same bound in 
expectation. Thus, SCRiBLe for multi-armed bandit enjoys the bound 



1 



which concludes the proof. 



□ 
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